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Abstract 

The problem of clustering large complex networks plays a key role in several scientific fields ranging from Biology 
to Sociology and Computer Science. Many approaches to clustering complex networks are based on the idea of max- 
imizing a network modularity function. Some of these approaches can be classified as global because they exploit 
knowledge about the whole network topology to find clusters. Other approaches, instead, can be interpreted as local 
because they require only a partial knowledge of the network topology, e.g., the neighbors of a vertex. Global ap- 
proaches are able to achieve high values of modularity but they do not scale well on large networks and, therefore, 
they cannot be applied to analyze on-line social networks like Facebook or YouTube. In contrast, local approaches 
are fast and scale up to large, real-life networks, at the cost of poorer results than those achieved by local methods. 

In this article we propose a glocal method to maximizing modularity, i.e., our method uses information at the global 
level, yet its scalability on large networks is comparable to that of local methods. 

The proposed method is called COmplex Network CLUster DEtection (or, shortly, CONCLUDE.) It works in two 
stages: in the first stage it uses an information-propagation model, based on random and non-backtracking walks of 
finite length, to compute the importance of each edge in keeping the network connected (called edge centrality.) Then, 
edge centrality is used to map network vertices onto points of an Euclidean space and to compute distances between 
all pairs of connected vertices. In the second stage, CONCLUDE uses the distances computed in the first stage to 
partition the network into clusters. 

CONCLUDE is computationally efficient since in the average case its cost is roughly linear in the number of edges of 
the network. Testing on diverse benchmark datasets shows good results, both in times and quality of the clustering; 
that is the case for synthetic networks with well-defined clusters as well as for real-world network instances. 
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1. Introduction 

In a network, the term community structure indicates the presence of groups of vertices called communities or clusters 
such that there is a large number of edges connecting vertices inside the same community and few edges linking 
vertices located in different clusters IfTTl . For a given network, represented by a graph G = {V, E) where V is the set 
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of vertices and E the set of edges, the community detection problem consists of finding a partition of the vertices in V 
of the form C - {C\, . . . ,Cq] such that each C,, 1 < i < q exhibits the community structure described above. 

The detection of clusters in large graphs is becoming a central research problem in a variety of different areas including 
VLSI design, parallel computing, computer vision and social network analysis [17J. Such an interest depends on the 
fact that many real-world systems consist of separated modules that interact with each other. If modules are clearly 
identified, it is then possible to understand the role each module plays in the overall behavioral dynamics of the 
system, and to study how potential changes in the structure and functions of a module would impact the overall 
system. As an example, in the biological domain community detection algorithms have been deployed to clarify the 
functioning of metabolic networks 126J or to understand how some proteins interact in small groups (or subsystems) 
called 'modules,' or forming so-called complexes i46l . In Computer Science and Sociology, community detection 
algorithms have been exploited to understand the social structures arising from the interactions of single individuals; 
this has relevant practical applications, for instance in customer segmentation. 

Early methods for finding communities were rooted in graph theory, e.g., the principle of maximum flow and minimum 
cut IfTTl . The main limitation of these methods is their high computational cost, which restricts their applicability to 
toy models or small instances (e.g., samples) of real-world networks. 

Recently, attention is increasingly paid to spectral clustering methods Il28ll40ll44ll48l . which strive to optimize suit- 
able cost functions. Empirical analysis provides evidence that spectral clustering methods are able to achieve excellent 
performance in some domains, e.g., image segmentation. Most of the spectral clustering methods are parametrized to 
the desired number of clusters, denoted by an integer k. In many cases, spectral clustering methods tend to produce 
clusters of (almost) equal size. However, in a broad range of application domains, such features have been considered 
as serious drawbacks, especially since often there is no available information to correctly tune k. In addition, the 
identification of equally-sized clusters contradicts some well-known facts about real social networks: the size of com- 
munities greatly varies, ranging from few communities gathering a large number of individuals, to many communities 
containing only few participants (8] |9l [41] . 

A breakthrough in community detection has been the introduction of a cost function called network modularity (or, in 
short, modularity, usually denoted as Q); it is based on the edge density in a graph for any candidate partition. In the 
latest years, several algorithms that try to maximize Q have been designed, see, e.g., ll4ll8l fT4ll37L 

Empirical studies carried out by Newman and Girvan f37] |39l on a wide variety of artificial and real networks high- 
lighted a correlation between high values of modularity and the actual community structure of a network. As a further 
empirical result, Newman and Girvan proved that real-world networks endowed with a clear community structure 
have their Q measure ranging from 0.3 to 0.7. That means that the task of maximizing Q is instrumental to disclose 
the community structure of a network and this is perhaps the main reason for its widespread adoption and the success 
of approaches that maximize modularity. 

Unfortunately, the maximization of Q poses two main challenges. First of all, the Q function is defined over all 
the vertices of the graph G and, therefore, to maximize it we should take into account the whole network topology. 
Approaches based on the knowledge of the whole network structure are defined as global approaches. Among them, 
we cite the Girvan-Newman algorithm 11211 (which is to the best of our knowledge the first algorithm that attempt 
to maximize Q), the method based on information centrality (19] and several others (please see |17| for a survey). 
The worst-case time complexity of global approaches is, unfortunately, very high. Thus, these strategies cannot be 
successfully applied on very large networks comprising millions of vertices and billions of edges. 

In contrast, another family of approaches considers local information, like the knowledge of the neighbors of each 
vertex to perform graph clustering. These are called /oca/approaches and one of the best-known is Louvain Method - 
LM In general, these approaches use greedy strategies to maximize Q and, therefore, they may get trapped into 
local optima and, ultimately, fail to discover the actual community structure of a graph. 

Second, the optimization of modularity has a major drawback called resolution limit lITSl . i.e., the impossibility of 
finding communities of small size, under certain circumstances, that typically depend on the topology of the network. 
Several authors have proposed ad-hoc solutions to alleviate the resolution limit problem such as providing novel 
definitions of modularity Il32l or adding weights to the edges ||3] . 
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In this article we both define a methodology for clustering graphs which couples the accuracy of global approaches 
with the computational performances guaranteed by local methods and, at the same time, mitigates the resolution limit 
effect. The proposed approach is called COmplex Network CLUster DEtection or, shortly, CONCLUDE. 

CONCLUDE works in two stages: in the first one, it maps graph vertices onto points of an Euclidean space and it 
computes the pairwise distance among them by exploiting a Euclidean-like distance metric. In the second stage, it 
adopts distances between points as a guide to perform clustering. 

In the first phase, in order to map graph vertices onto points, CONCLUDE relies on the concept of K-path central- 
ity IfTTl ri2i . The /(--path edge centrality of an edge is defined as the probability that the edge is selected to spread 
information in the network; such probability is computed by a suitable process of information diffusion simulated 
by the algorithm over the network itself. The concept of /c-path edge centrality fits quite well with our intuitive, 
experience-based notion of community; groups of individuals who frequently exchange information each other while 
the possibility that an information goes out from the community should be regarded as an unlikely event. Find- 
ing edges with high centrality is equivalent to disclosing preferential channels along which information flows and, 
ultimately, this is useful to reveal communities of individuals. 

The definition of A:-path centrality is based on an information propagation model in which we assume that a message 
(representing a basic piece of information) is injected in an arbitrary vertex of the network and it flows along random, 
non-backtracking paths of length up to k. This model is justified by the fact that users are treated in an equally 
fashion and, therefore, each user is allowed to generate and spread a message. In addition, each user has only a partial 
knowledge of the network topology and, in particular, she is only aware of her neighborhood. For these reasons, the 
propagation path followed by a message is unpredictable because each user can decide on her own the person to which 
a message has to be forwarded and, therefore, such a path can be intended as randomized. 

The other assumption is that paths are non-backtracking. In fact, we understand that a user wants to maximize the 
number of persons getting the message and, therefore, she has no interest in contacting her sources. Finally, since 
distant users are not likely to influence each other, propagation paths are also required to be of bounded length. 

The computation of /c-path edge centrality of an arbitrary edge is computationally demanding since it should consider 
all paths containing the given edge and, in principle, there could be exponentially-many such paths (in the number 
of vertices of the graph). We propose here a heuristic algorithm, called ERW-Kpath (Edge Random Walk /c-path 
Centrality), to efficiently compute edge-centrality values. 

Once /(--path centrality values have been computed, CONCLUDE proceeds to compute the distance between each pair 
of connected vertices. Such a definition is based on the principle of structural equivalence [47]: two vertices / and j 
are considered close if their neighbors are close too. In particular, a vertex k which is a neighbor of both / and j is 
assumed to be close to both / and j if the probability that a message flows from k to / is comparable to the probability 
that it flows from k to j. If these probabilities are large enough, then a message received by k will be forwarded, with 
high probability, to both / and j. Vice versa, if these probabilities are small enough, then a message received by k is 
not likely to be forwarded to / and j. The probability that a message flows from the vertex k to the vertex / (resp., j) 
coincides with the centrality of the edge linking k to / (resp. to j.) 

After the mapping has been performed and weights are in place, in principle we can follow it up with any off-the-shelf 
local algorithm for detecting communities in graphs. In this article we discuss clustering by the Louvain Method 
described above, as we found it particularly suitable to be embedded into our CONCLUDE algorithm. 

According to the discussion above, we can notice that CONCLUDE advances the state of the art in a number of 
directions: 

• In order to compute the edge centrality values, we use non-backtracking random walks of finite length k. The 
value of K can be fixed in an arbitrary fashion and, for large values of k (e.g., for k in the same order of magnitude 
of the graph diameter) our walks allow to explore portion of the graph which are quite far from each other. In 
this respect, the /c-path edge centrality has to be intended as an information at the global level because the 
horizon of the walker encompasses the whole graph. Interestingly, the simulation of such random walks require 
to select at random the starting vertex and, for each selected vertex, the walker is asked to know the neighbors 
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of that vertex. In this way, the walker is not required to know in advance the whole graph topology nor to store 
it. 

• Our analysis shows that the worst-case time complexity of CONCLUDE is near linear in the number of edges 
in the graph. This makes our approach computationally competitive and its performance is comparable with 
those of local algorithms. 

• According to our definition of distance, given three (or possibly more) vertices /, j and k, these vertices are 
mapped onto close points if, when one of the three receives a message, then such a message is conveyed with 
high probability to the other two vertices. In this way, a group of vertices that forms a community is mapped 
onto a dense region of an Euclidean space. This ultimately makes the process of identifying communities more 
effective. 

• In the computation of distances, we used edge centrality as weights on edges. As outlined above, the procedure 
of weighting edges is, in general, beneficial to reduce the resolution limit problem. In a previous work 11 U . we 
showed that the usage of edge centrality in conjunction with some clustering algorithms raises the accuracy of 
these algorithms in comparison of that computed without exploiting weights on edges. This article, however, 
differs from our previous work because in 1 11 1 we used a weighting procedure on edges to improve the accuracy 
of the clustering process. In this article, instead, we use edge centrality to compute distances between pairs of 
vertices and, as a final outcome, to map graph vertices onto points of a Euclidean space. Such a mapping 
is interesting per se because a wide range of complex analysis tasks are allowed (like finding the A;-nearest 
neighbors of a vertex). 

Performance and scalability of CONCLUDE have been assessed through experiments and benchmarking both real 
and artificial network datasets. In particular, we considered 6 real-world datasets where the largest, a sample from 
Facebook, consists of 63,731 vertices and 1,545,684 edges. We compared the modularity achieved by CONCLUDE 
clustering against the results of well-known algorithms such as the Louvain Method alone, COPRA ||25l and OSLOM 
Il30ll . As for synthetic (artificially-generated) networks, we used the LFR benchmark f29| to generate 72 networks 
whose community structure was known in advance. We compared communities found by CONCLUDE with the 
actual ones by using the so-called Normalized Mutual Information measure from Information Theory lOj. Finally, 
to make the algorithm available for the research community, an implementation of CONCLUDE can be now freely 
obtainecQ 

The outline of the article is as follows: in Section |2] we provide a detailed discussion of main methods in literature 
to find communities in networks based on the principle of modularity maximization. In Section |3] we provide the 
definition of A:-path centrality and describe our ERW-Kpath algorithm. The main features of CONCLUDE are covered 
in SectionHlwhereas an in-depth experimental analysis of CONCLUDE performance is discussed in Section|5] Finally, 
in Section|6lwe draw our conclusions. 



2. Background 



In this section we first describe approaches to finding communities based on the principle of maximizing the network 
modularity |21 1 (in Section 2.1 ) Next, we discuss the shortcomings associated with the maximization of this function 
and illustrate some strategies to cope with them (Section [Z2|) 



2.1. Finding Communities by Maximizing Network Modularity 

The concept of network modularity or, in short, modularity was introduced by Girvan and Newman fT\\. Network 
modularity is usually denoted as Q and it is based on the idea that a random graph is not expected to exhibit a 
community structure. Therefore, for a given a graph G - (V, E) and a partition (or clustering) C - {Ci , . . . , Cq] of the 
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vertices of G, a random graph G is built by copying each vertex of G onto a vertex in G' . Thus, for each vertex / in 
G, there exist an homonym vertex in G' . Because of each vertex in V uniquely corresponds to a vertex in V and vice 
versa, for each vertex / e V we shall denote as /' e V the unique vertex corresponding to /. In G' an edge is drawn 
between a pair of vertices according to a uniform probability and, in addition, the degree sequence in G' matches the 
actual degree sequence of G. Therefore, if a vertex / has degree dj in G, then also /' has degree t/,-. Furthermore, if the 
total number of edges in G is m, then the total number of edges in G' will also be m. Finally, the expected number of 
edges linking two vertices and / is proportional to the product of their degrees and, in particular, it is equal to ^ . 
The graph G' is called the null model for G. 

For a given subgraph C, £ C, we can select one by one the vertices forming C, and consider, for each selected vertex, 
the corresponding vertex in G'. This leads to identify a subgraph in G' corresponding to C,. The subgraph C, is 
classified as a community if the density of internal edges in C, is significantly greater than in C,'. 

The Q function formally encodes the previous reasoning, i.e., for each community C,-, it checks if an edge in C,- exists 
also in C,'. This leads to the following formula 



Here A/j = 1 if vertices / and j in G are connected by an edge and otherwise and C^'' (resp., C'-'*) is the community 
containing the vertex /. Here 6{-,-) denotes the Kronecker function; two vertices i and j provide a non zero contribution 
to the value of Q if and only if they belong to the same community. 

The problem of maximizing Q has been proved to be NP-hard by Brandes et al. The first, non-trivial, approx- 
imability results beyond the NP-Hardness were proposed by Das, Gupta and Desai fTOl. They studied dense graphs 
separately from sparse ones and their main result proves the (1 + e)-in-approximability of Q in the case of dense 
graphs and a logarithmic approximation in the case of sparse graphs. 

Several heuristic strategies to maximize the network modularity Q have been proposed as to date. Probably, the most 
popular one is known as the Girvan-Newman strategy 11211 [39l. 

In Girvan-Newman, edges are ranked by using a parameter known as edge betweenness centrality. The edge between- 
ness centraUty B(eij) of a given edge e,y e E connecting the vertex / with the vertex j is defined as 



where I and m are arbitrary vertices in V, np{l, m) is the number of shortest paths connecting / and m and np{l, m, e,j) 
is the number of the shortest paths between I and m containing 

It is possible to maximize Q by progressively deleting edges with the highest value of betweenness centrality, based on 
the consideration that they shall connect vertices belonging to different communities |39|. The process iterates until 
a significant increase of Q is obtained. At each iteration, each connected component of G identifies a community. 
Unfortunately, the computational cost of this strategy is (9(|yp) and this makes it unsuitable for the analysis of large 
graphs. The most time-expensive part of the Girvan-Newman strategy is the calculation of the betweenness centrality. 
Efficient algorithms have been designed to approximate the edge betweenness |5J, or to efficiently compute shortest 
paths, for example in the context of weighted graphs li43J . For real-world graphs, however, the computational costs 
still remains prohibitive. 

Several variants of this strategy have been proposed during the years, such as the Fast Clustering Algorithm provided 
by Clauset, Newman and Moore fSl, that runs in C>(|y| log |y|) on sparse graphs. In lfT4l . Duch and Arenas pro- 
posed the extremal optimization method based on a fast agglomerative approach whose worst-case time complexity is 

odypiogivi). 
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An interesting network modularity maximization strategy is provided in the so-called Louvain method (LM) H, which 
will be extensively described in Section 4.2 



The approaches mentioned above use greedy strategies to maximize Q. In fl^ the authors propose to use simulated 
annealing to maximize Q. This approach achieves a high accuracy but can as well be computationally very expensive. 
The main advantage of simulated annealing is that it does not exploit any greedy strategy, thus it is less Ukely to incur 
in the problem of becoming stuck in local optima. Starting from a randomized partition and an arbitrary parameter 
that simulates the temperature of the system, the simulated annealing algorithm computes the energy of the current 
configuration (say Ea), according to a given suitable function. After some changes in the partitioning are applied, 
the algorithm recomputes the energy level of the new configuration (say Eh) and accepts the current solution if the 
new energy value is lower than the former one. Interestingly, also a non-optimal solution might be accepted, with 
a probability Pr(7) proportional to the temperature T of the system, computed according to the Boltzmann factor 
Priy) - e"'^^"*/^ where AEgb - Et, - Eg. Then, the algorithm lowers the temperature and iterates the process. The 
lower the temperature, the lower the chance of accepting non-optimal solutions. This means that in the early stage 
the algorithm tends to accept non-optimal solutions more frequently, and in the late stage the optimal solution should 
emerge. This helps the algorithm to avoid getting stuck in local-optima, but contributes to its high computational cost. 

2.2. Limitations of Network Modularity 

In this section we discuss some limitations of the approaches attempting at optimizing Q. A first problem of modularity 
depends on the fact that there exists an exponential number of partitions of a graph whose modularity values are close 
each other and these values are also close to the the global maximum of Q ll23l . On the one hand, such a result 
explains why methods which are in principle very different each other generate graph clusterings whose modularity 
values are quite close. On the other hand, different algorithms could produce clusterings which significantly differ 
each other but each of those could be associated with similar modularity scores. The main consequence is that none 
of these clusterings appears to be better than others, unless we can manage additional information explaining how 
communities are organized. 

The second (and perhaps most important) problem is known as resolution limit [18|. Informally, the resolution limit 
problem consists in the fact that if we hypothesize that communities of small size exist in the original graph, the value 
of Q increases if we merge these communities into a larger one. Merging these communities might be potentially 
wrong because, in order to pursue the maximization of Q, we would ignore small communities possibly relevant in 
the community structure of a graph. From a quantitative standpoint, the resolution limit problem arises whenever in a 
graph G there exists a community C such that the sum of the degrees of vertices inside C is less than Vm being m the 
number of edges in G 1 1 8 1 . 

The resolution limit problem directly depends on the definition of modularity (as reported in Equation ([T]l) and, in 
particular, from the null model exploited for defining Q. In the definition of the null model we assume that an arbitrary 
pair of vertices in the graph can belong to the same community, independently of the position of each vertex in the 
graph or, in other words, that the horizon of an arbitrary vertex coincides with the whole graph. Such an assumption 
can hold true for small-sized graph but it is certainly false for large scale networks like the Facebook social graph. 
A possible strategy to address the resolution limit would therefore require to assume that each vertex has a partial 
horizon, and, consequently, it is allowed to interact with just a portion of the graph. This would lead to consider local 
measures of modularity [|35l . 

Li et al. |32| introduced a function, called modularity density which is not based on a null model. The usage of 
modularity density is proven to yield better results than those achieved by the modularity defined in Equation [T] 
despite it continues suffering from the resolution limit problem. To avoid the resolution limit, the authors suggested a 
more general definition of modularity density which depends on a parameter A. The tuning of A allows for exploring 
the graph at various levels and, depending on the specific value of A, it can favor the discovery of small communities 
or of large communities respectively. An analogous study is presented in |2 |. 

Berry et al. I3] studied how to alleviate the resolution limit problem in the context of weighted graphs. In that paper, 
the authors suggested to assign a weight equal to 1 to each inter-cluster edge and a weight equal to e to each intra- 
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cluster edge, being < e < 1 . Because of this weighting procedure, the authors proved that a cluster is not detected 
if the following condition occurs: w, < - e, being Wj the sum of the weights of intra-cluster edges, W the sum 
of all weights of the edges in G and s the maximum weight of an inter-cluster edge. Such a result is better than the 
resolution limit found in [18J for unweighted graphs and this means that by suitably weighting a graph it is possible 
to discover small-sized clusters still by optimizing modularity. 



3. The if-path edge centrality 

In this section we briefly describe our method to rank edges in a graph, called /c-path edge centrality. The /f-path edge 
centrality has been introduced in [12.1 to better understand the strength of the tie bonding two vertices and, as a further 
step, to produce better graph clusterings IfTTI . 

The notion of A--path centrality relies on the idea of ranking edges in a graph according to their capability of spreading 
information. In our approach, the basic piece of information is called message. The introduction of /f-path edge 
centrality is justified by our intuitive definition of community: in fact, we assume that communities in social networks 
can be thought as circles of friends who frequently exchange messages each other. Ranking edges on the basis of 
their aptitude to spread messages is therefore instrumental to highlight preferential pathways along which information 
flows and, ultimately, this is relevant to disclose communities. 

To define the /f-path edge centrality we need a mathematical model describing how information flows in a network. 
To the best of our knowledge, one of the first models to rank edges in a graph on the basis of how information spreads 
and, subsequently, to discover communities was introduced by Fortunato et al. in |19|. Such a model assumes that 
information from a source vertex to a target one is forced to travel along the shortest path connecting the two vertices. 
Therefore, given a pair of vertices / and j, Fortunato et al. defined a parameter, called efficiency as the inverse of the 
length of the shortest path connecting / and / 

The hypotheses governing the model of Fortunato et al. could not hold true in real scenarios: for instance, in online 
social networks like Facebook, a user is agnostic about the whole network topology and, therefore, she is not able to 
find shortest paths. In addition, the computation of efficiency requires to calculate all pairs of shortest paths and this 
task can be prohibitively time-expensive in large networks. 

To solve these drawbacks, we drop the assumption that information is forced to flow along shortest paths and assume 
that all paths in the graph can be exploited to convey information. As a consequence, since all paths are eligible to 
convey messages, we assume that a sender selects at random one of these paths to transmit the original message. 

We pose two further requirements: 

1. Simple Paths; We must avoid that in the information propagation process an edge is selected more than once. 
This encodes the fact that, in reality, whenever a user wants to disseminate a message, she has to maximize its 
coverage (i.e., she wants that the message is delivered to as many users as possible.) Therefore, she must avoid 
that the message is sent twice to the same user. 

2. Bounded Length Paths. As shown in |20l, distant vertices in social networks (i.e., those vertices that are con- 
nected by long paths only) are unlikely to influence each other We agree with this observation and figure that 
two vertices are classified as distant if the path connecting them is longer than k hops, being k a fixed threshold. 

A path satisfying the requirements above is said simple K-path. We are now in the position of formally introducing the 
concept of /f-path centrality: 

Definition 1. (K-path edge centrality.) Let: (i) G - {V,E) be a graph, (ii) /c > be an integer and (Hi) e,j e £ be an 
edge of E connecting the vertices / and j. The A--path edge centrality L^ieij) of e,j is the sum, over all possible source 
vertex s, of the probability with which a message originated from s traverses e,y, assuming that the message traversals 
are only along random simple A--paths. 
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If we define as Pr(e, s) the probability of selecting the edge e in a random simple K-peAh originating from an arbitrary 
source vertex s, the centrality of an edge e,y reads as follows 



Unfortunately, Definition[T]is hard to apply in practice because we need to consider all potential random simple paths 
of length at most a- in G and such a number could be exponential against the number of vertices in G. To overcome 
this limitation, we decided to use multiple random walks to simulate the propagation of a message. Such an idea has 
been already successfully exploited to compute the centrality of vertices in graphs ifTl lSSl . 

The usage of random walks to simulate simple /(--paths allowed us to design a heuristic algorithm to efficiently ap- 
proximate edge centrality. Our algorithm is called ERW-Kpath - Edge Random Walk K-path Centrality. It takes as 
input a graph G - {V, E), an integer k and an integer p. The algorithm performs p iterations. We studied the impact of 



p on algorithm performances in Theorem 3.2 



At each iteration, the ERW-Kpath algorithm works as follows: 

1. A vertex / e G is selected as starting vertex uniformly at random among all vertices in V. A message is injected 
in the graph starting from i. All the edges in E are marked as unvisited. A weight equal to 1 is assigned to 
each edge. 

2. The message is propagated in the network as follows: starting from /, a vertex j adjacent to / is selected at 
random with uniform probability, assuming that the edge e,j linking / and j is marked as unvisited. If there 
is no edge e,y marked as unvisited then the Step 2 ends. The weight of e,y is increased by 1, e,y is marked as 
visited and the propagation process restarts from j. The process ends if one of the two following criteria is 
met: (i) a path of length k has been generated or (ii) there is no edge associated with j which has been marked 
as unvisited. 



The label associated with each edge prevents from visiting an edge more than once. When the ERW-Kpath algorithm 
ends, the weight of each edge e,j is divided by p and the obtained value is the edge centrality L*(e,/) of e,j returned by 
the ERW-Kpath algorithm. 

The worst-case time complexity of the ERW-Kpath algorithm is 0{Kp) because the external loop is iterated exactly 
p times whereas the internal one is carried out at most k times. It is possible to show that the ERW-Kpath algorithm 
provides an accurate approximation of the actual centrality value of an edge. To make the presentation of our results 
simpler, we shall use a simplified notation in which an edge connecting two arbitrary vertices / and j will be denoted 
as e rather than e,y (as we did before.) 



Before illustrating Theorem 3.2 we are in charge of defining the deviation error associated with an edge e e E. 



The deviation error is defined as follows 



\U(e)-L''(e)\ 

Q = ^ 

L^(e) 

and it assess the (percentage) deviation of L^ie) to L''(e). The smaller e^, the better the approximation of L^ie). 

Our goal is to show that, for an arbitrary edge e and an arbitrary constant e, the probability that the deviation error is 
larger than e is bounded by a function of the form exp Such a result states that a little increase in p (and, therefore, 
in the running time of the ERW-Kpath algorithm) is able to produce a relevant decrease in the diff'erence between e^, 
and e and this ultimately implies that the algorithm is able to quickly converge to the actual values of edge centrality. 

To prove such a property we need a preliminary result known as Hoeffding inequality: 
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Theorem 3.1. (Hoeffding inequality) Let Xi, . . . ,X„he independent random variables. Assume that, for each / such 
that !</<«, the random variable X, ranges in the real interval [a,, /?,]. Let X - (Xi + ■ ■ ■ + X„)/n. For any f > we 
have 

/ 2tV \ 

Pr(|X-Em|>r)<2exp -— . (3) 



Proof. See 1221 



□ 



As a special case, if all random variables Xj can only take up value or 1 then Equation ([3]l simplifies to 

Pi(\X - E[X]| > f) < 2 exp (-2f2«) . (4) 

We are now able to prove our claims. 

Theorem 3.2. Let: (i) G = {V, E) be a graph, (ii) {K,p) be a pair of positive integers and (Hi) e a positive real number. 
Assume to run the ERW-Kpath algorithm on G so that the algorithm generates p simple paths of length up to k and let 
Uie) be the edge centrality value of an arbitrary edge e e E returned by the ERW-Kpath algorithm. The probability 
that the deviation error is greater than or equal to e is bounded by e^P. 

Proof. By Definition[T]and the definition of conditional probability, we get 

In Equation jsj the term Pr(s) is the probability that s is selected as the source vertex and it equals because the 
starting vertex is selected with uniform probability across all vertices in V. 

The ERW-Kpath algorithm performs p iterations, and, at each iteration, it generates a simple A:-path. In our analysis 
we will first focus on the result produced in a given iteration, say the l-ih iteration with I <( < p. 

Let us define the random variable Ygie) as follows 

f 1 if e has been selected at the ^'-th iteration 
\ otherwise. 

Since Ycie) is an indicator variable, we have that ElYiie)} - Pr(F,(e) = 1). 

For a fixed {, Pr(yf(e) = 1) is equal to the probability of selecting e independently of the choice of starting vertex s. 
Therefore, if we denote as we did before Pr(e, s) the probability that ERW-Kpath algorithm selects e when the starting 
vertex is s we get 

Pr(7,(e) = 1) = Z Pr(e, ^) = J] Pr(ek)Pr(^) = 7^ ^ P""^"^'"^ 

The ERW-Kpath algorithm increases by 1 the weight of e each time e is selected and, at the end, it divides the total 
weight by p. It follows that the value of centrality Uie) returned by the ERW-Kpath algorithm is a random variable 
distributed as - Yc{e). The average of L^ie) is as follows 



^ 1 



= - X p X L'ie) = L'ie) 
P 
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From Equation|4j with t - eU{e) it follows that 

Pr(|L''(e) - L^Ce)] > etie)) <2Qxp[-2pe^L''{ef) 
Since L*(e) > ^ > 0, the previous equation can be rewritten as follows 
\U{e) - U{e) 



Pr 



> e 



Klexp^-lpe^tief) ^ Pr(e, > e) < 2 exp [-2pe^ L" (ef) = Kexpi-p) 



where K = exp (2e^L''(e)^y The latest equation states that the probability that the deviation error associated with an 
arbitrary edge e is greater than e is bounded (up to a constant factor K) by the function e^f and this ends the proof. 



□ 



According to Theorem 3.2 we can conveniently fix p so that to make Pr(^|L*^(e) - L*^(e)| > e) for an arbitrary threshold 



e as little as we need. In particular, for a fixed e, we can set p - ^^^|f^, being a any real number greater than 1 to 
obtain the following result 



iL^ie) - tie)] ] 
PrI ^ > e 



< txp{-2pe^L\ef) = exp [-2 



= exp(-alog IVI) = ^ 



In such a case, the worst-case time complexity of the ERW-Kpath algorithm is 0(Ka\og \V\) and its deviation error 
(for each edge e e £) is no larger than . 



4. Computing Distances and Finding Communities 

In this section we describe how to use /c-path edge centrality to compute distances between vertices in a graph (see 
Section |4.1| i; after this, we show how to use these distances in conjunction with the Louvain Method to identify 
communities in a graph (see Section [4!2] ) Finally, we compare our method to a measure of centrality that turns out to 
be very close to ours (see Section |4.3[ ) 

4.1. Computing Distances between Graph Vertices 

Once A:-path centrality values have been computed, CONCLUDE proceeds to calculate the distances between each 
pair of vertices; computing distances allows for embedding graph vertices into a metric space. Such a feature is per 
se interesting, independently of the fact we used such an embedding to perform graph clustering. 

In fact, in many Database and Information Retrieval Applications, a large (and perhaps one of the most common) 
class of queries assumes that a collection of data items is available and requires to find the items in the collection best 
matching with a particular query item, according to some definition of best. For example, given a database reporting 
images, a typical query is to find all images that are similar to a particular query image. If a metric distance can 
be defined between pair of items composing the database, then it is possible to design eflicient distance-based data 
structures which make the retrieval or the indexing of that items easy and fast even on huge collections Q. 

Our approach to compute distances relies on the computation /f-path edge centralities and this provides two main 
advantages. First, the computation of distances is fully automatic and does not require any human intervention. 
Second, our definition of distance is based on the information propagation model discussed in Section[3] This provides 
our definition of distance with a meaningful interpretation: two points are as much distant as the chance they can 
exchange messages is low. In particular, two vertices / and j are deemed close if a message received by / is sent, with 
high probability, to the vertex j and vice versa. To formally encode such a reasoning, we could first define a proximity 
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measure cTij between the vertices / and j as the /c-path centrality of the edge linking / and j. Unfortunately, such a 
measure of proximity is not satisfactory for two reasons: first of all, no edge could exist between vertices / and j and, 
in such a case, the proximity a-jj would be undefined. Even assuming the existence of an edge from / to j and an edge 
from j to /, it can happen that the centrality of the edge going from / to j differs from the centrality of the edge linking 
j and / and this implies that cr,^ + crji, thus violating the symmetry requirement imposed by the concept of proximity. 

To overcome these limitations, we rely on the principle of structural similarity f4T] and we guess that two vertices 
i and j are to be considered as close if their neighbors are close too. In line with the previous reasoning, a vertex 
k is close to both ; and j if the probability that a message flows from k to / is comparable to the probability that it 
flows from k to /'. The same definition implies that the vertices / and j are classified as distant if the probability that a 
message flows from k to / is high (resp., low) and the probability that a message flows from k to j is low (resp., high.) 
The probability of conveying a message from k to / (resp, j) coincides with the /c-path edge centrality of the edge e^i 
(resp., eitj.) Such a definition is quite interesting because the vertices /, j and k would form a triangle in which, if 
one of the three vertices receives a message, then such a message is conveyed with high probability to the two other 
vertices; in an analogous fashion, if no one of these vertices receives a message, the chance that the two other vertices 
will receive it is low. 

By generalizing the previous concept, we can consider a group C of vertices in G and we can observe that if a message 
flows with high probability among the vertices in C then C is mapped onto a dense region and this highlights that C is 
a community in G. This argument leads us to consider the following definition of proximity 



cTij = [metd - mekj)] (6) 

\/t£V 

Equation |6] shows some nice properties: it uses the contribution of all vertices in V to quantify the proximity degree of 
vertices / and /'; in addition, it is based on the traditional Euclidean distance definition and therefore, it maps vertices 
of G onto points of an Euclidean space. 

In Equation|6] we assume that all vertices are equally important in deciding whether two vertices are close, regardless 
of their degrees. Such a choice, however, can yield unwanted results because the degree of vertex can influence the 
centrality of edges incident onto it. 

To convince ourselves of this fact, let us select at random a vertex s e V and assume to generate a random simple 
/c-path ps - {s, ii, i2, ■ ■ ■ , Ik-i} starting from s. Let us focus on a single run of the ERW-Kpath algorithm and observe 
that the probability of selecting the edge connecting s and ii is equal to the ratio ^j:^^, being N(s) the set of neighbors 
of the vertex s. Analogously, the probability of going from ii to i2 is equal to pT^jq^yj- The previous formula is 
justified by the fact that the path generated in a single run by the ERW-Kpath algorithm cannot pass twice for the 
same edge: therefore if i2 would coincide with s, the edge connecting !2 with /] cannot be selected. In such a case the 
vertex i2 must be excluded from A^(/i). 

By generalizing the previous result, the if we consider a path p, - {s,ii, . . . , //) and we focus on two vertices /„ and 
i'fl+i, the probability to select the edge joining i„ and ia+\ is 



if N(ia)Q{s,iu-- ■ 
1 

[ \N(ia)-{s,iu---,ia-i} 



,ia-l] 

- otherwise. 



Intuitively, the special case Pr(e, , ) = occurs when all edges incident on N(ia) have been akeady visited during 
the considered path {s, /i, . . . , ia-\}- We denote with deg(ia) the degree of i„ and observe that deg(/a) - |A^(/fl)| > 
\N(ia)-{s, i'l, . . . , /a_i)| > \N{ia)-K\. In practical scenarios, the value of k is intended as small and, therefore, Pr(e,^,;^^|) 
is proportional to ^ . 
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According to these observations, the probabihty of selecting an edge is biased by the degree of the vertex associated 
with it. Therefore, the larger the degree of a vertex /, the easier is that a message passes through ; in several simulations. 
Therefore, / would be classified as close to almost all vertices. 



This suggests us the idea of alleviating the dependency of the proximity calculation on the degrees of the considered 
vertices. To do so, we suggest to normalize the proximity as follows 



1^ [L^ietd - L^iekj)]- 



and, subsequently, to define a the distance measure between vertices as djj - I - cTij. 

The naive implementation of the distance computation could be time-expensive as it requires iterations. Yet 

its cost can be decreased to nearly linear by observing that we can rewrite Equation|7]as 



keN{i)-CN(i.j) kemj)-Cmi,J) keCN(iJ) 

^ |A^(0-CA^(/,7)I ^ \N(J)-CN(i,j)\ ^ \CN(i,j)\ 
where the symbol CN{i, j) indicates the subset of neighbors common to / and j. 

By adopting this shrewdness, the cost of the computation is reduced to 0{d{v)^\V\), where d{v) is the average degree 
of the vertices of the network. 



4.2. Finding Communities 

The last step of our method consists in the network partition. CONCLUDE adopts the paradigm of the network 
modularity maximization which we described in detail in Section |2] and exploits an approximate technique inspired 
by the Louvain method (LM) |4|. 

LM is an computationally effective algorithm, thus it is well-suited for partitioning large networks. It consists of two 
steps which are iteratively repeated. The input of LM is a weighted network G - {V, E, W) being W the weights 
associated with each edg^ The modularity is defined as in Equation ([TJ, in which is the weight of the edge 
linking / and j and k/ (resp., kj) is the sum of the edges incident onto / (resp., j.) 

Initially, each vertex / forms a community and therefore, there are as many communities as vertices in V. 
The two steps of the LM operate as follows: 

(i) For each vertex /, the algorithm computes the gain AQ derived from moving / to a cluster C, as 



AQ 



2m 











2m j 


2m 


\2mj 


\2m 



(9) 



Here Y^c is the sum of the weights of the edges inside C, 2c is the sum of the weights of the edges incident onto 
vertices in C, A;, is the sum of the weights of the edges incident to the vertex /, k^ is the sum of the weights of the 
edges from / to vertices located inside C and m is the sum of the weights of all the edges in the network. 

The vertex / is placed in the cluster C for which the gain achieves its maximum value. If it is not possible to achieve 
a positive gain, the vertex / will remain in its original community. This process is applied repeatedly and sequentially 
for aU the vertices until no further improvement can be achieved. 



^Of course, in case of unweighted graphs, W is the adjacency matrix of G. 
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(ii) In the second step, the algorithm builds a meta-network whose vertices are those clusters found in Step (i), 
collapsing all edges among vertices belonging to a pair of cluster onto a single edge. The weight of this edge is 
equal to the sum of the weights of the collapsed edges. Once the second step has been performed, the algorithm 
re-applies the first step. 

Steps (i) and (ii) are repeated until an arbitrarily small improvement AQ is attained at each iteration. The cost of the 
whole process is (9(y|y|), where y is the number of iterations required by the algorithm to converge (in our experience, 

r<5.) 

The advantage of our approach with respect to the original LM is twofold: first, we obtain the splitting of clusters 
connected by edges with low distance, which is a global feature, maximizing the network modularity, while LM only 
relies on local information (i.e., vertex neighborhood); second, our strategy is able to produce an edge weighting, 
while the original LM - and most of current clustering algorithms - cannot infer edge weights in case of unweighted 
networks. This aspect ensures better performance of our strategy in most of cases (as discussed in the remainder of 
this article.) 

Summarizing, by adopting efficient graph memoization techniques, the computational cost of CONCLUDE is near 
linear. In fact, it results from the three previously described steps, i.e., 0{k\E\ + d{v)^\V\ + y\V\) - 0{r\E\). 

The pseudo-code describing the various steps performed by CONCLUDE is reported in Algorithm [T] In detail, 
CONCLUDE takes as input the graph G, an integer k and an integer p. It first calls a subroutine ERW-Kpath that 
computes /c-path edge centralities on vertices of G by performing at most p iteration. The output of this subroutine 
is stored in an array of weights t3. In the second step CONCLUDE calls a subroutine Compute-Pairwise-Distances 
which takes as input the graph G along with the array of weights c3 and computes distances among all pairs of vertices 
by applying Equation[8] The subroutine Compute-Pairwise-Distances returns as output a matrix A containing all pairs 
of distances between vertices. Finally, in its final step CONCLUDE calls the subroutine Louvain-Method, which 
implements the LM method. It takes as input the matrix A and returns as output the community structure C of G. 



Algorithm 1 CONCLUDE(G = (V, £): a Graph, k: an integer, p: an integer) 
1: w ^ ERW-Kpath(G,/^,p) 
2: A <— Compute-Pairwise-Distances(G, w) 
3: C <- Louvain-Method(G, A) 



4.3. Comparison with existing definition of distances 

In the past, many papers have focused on the problem of computing distances between vertices of a graph and, 
subsequently, to use such a distance with the purpose of clustering the graph. A nice approach is due to Donetti and 
Munoz fTSl. In that paper, the authors suggested to consider the Laplacian Lc associated with a graph G, which 
is defined as L(G) = Dg - A^. Here Dg is a diagonal matrix such that Dc[/, /] is equal to the degree of the /-th 
vertex and Ac is the adjacency matrix of G. The graph G is assumed undirected and, therefore, Ac is a symmetric 
matrix. The approach of lfT3l suggests to compute the first D non-trivial eigenvectors of L(G), i.e., the eigenvectors 
corresponding to the D largest and non-zero eigenvalues of L(G). Subsequently, graph vertices are projected onto 
the subspace spanned by these eigenvectors. In this way each vertex is represented by a point in MP and the distance 
between two vertices is defined as the distance of the two vectors in MP corresponding to them. 

To make the comparison between our approach and that of Donetti and Mufioz lfT3l fair, we need to assume that G 
is symmetric. Under this conditions, if we relax the hypothesis of non-backtracking paths we can give a spectral 
interpretation of A:-path edge centrality values computed by our ERW-Kpath algorithm. The ERW-Kpath algorithm 
initially builds a n x n matrix Eg. The generic entry Kc[i,j] is equal to if there is no edge going from the vertex 
/ to the vertex j and it is equal to 1 otherwise. By construction, the matrix Eg is symmetric and, therefore, there is 
an orthogonal family of eigenvectors of Ci, . . . ,e„ associated with Ec L34J . This means that e' ■ ej - dij, being dij 
the above mentioned Kronecker function. We can write Eg as Eg = Hl^i 'iiS/eJ, being /I, the i-th eigenvalue. The 
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probability of going from the vertex I to the vertex m with a path of length 2 can be obtained by computing - EcEg 
and by picking Eg[/, m]. 

With some simple calculations we get 



2 ^i^i^'i 



and, more in general, E^ = YJLi 'ife/e^ In these products, the larger A, is, the higher its contribution to Eg- As in 
IfTSl . we could pick only the D largest eigenvalues to get an approximation of E^. The probability that ERW-Kpath 
selects an edge depends on matrices E^ (with p < k), and therefore, the edge centrality of a vertex is influenced by 
the top D non-trivial eigenvalues of Ep. According to Equation [8j the distances computed by our approach depends 
on edge centralities, and, therefore, we obtain that also distances in our approach are influence by the D eigenvalues 
having the largest magnitude. 

From a computational standpoint, lfT3l uses the popular Lanczos algorithm to find the eigenvalues of the Laplacian 
which, de facto, computes the power of a matrix by applying some efficient procedures like the Gram-Schmidt algo- 
rithm. Lanczos algorithm is efficient over sparse matrices but, unfortunately, to the best of our knowledge, there is no 
theoretical estimation of its worst-case complexity. Our approach, instead, takes (9(/ca log \V\) steps. 



5. Experimental Results 



CONCLUDE has been experimented in different fields of application, for example to analyze online social networks 
and biological networks. In this article we discuss its application to the following two cases (see Section 5.1 1: (i) 
artificially-generated (henceforth, synthetic) networks with a pre-defined community structure; ( ii) network datasets 
from real-world applications. We also experimentally studied the behavior of ERW-Kpath algorithm to rank edges 
according to their ability of spreading messages in a network. 



5.7. Cluster Detection 



In order to evaluate the performance of our clustering method, we carried out two different types of experiments. 
The former, by using synthetic networks for which a pre-built community structure was well defined. The latter, by 
considering different graphs describing the structure of some real-world networks. 

The purpose of the former set of experiments is to assess the quality of the clustering produced by our algorithm in 
the context of a controlled environment in which the features of the networks taken into account are well-known. In 
particular, the pre-defined community structure is adopted as a ground truth to evaluate the quality of the partitions 
obtained by using our algorithm, and this is done by adopting a measure inherited by information theory (called 
normalized mutual information) to establish the accuracy of the partitions with respect to the ground truth. 

The second set of experiments, instead, is carried out to evaluate the performance of our method compared against 
three other state-of-the-art algorithms in real-world applications. This comparison is performed by considering net- 
work datasets whose correct community structure is not known in advance; to assess the validity of obtained results we 



exploit the internal quaUty measure of network modularity previously presented in Section 2. 1 In fact, regardless the 
model adopted to finding communities in a network, the network modularity has been commonly adopted to establish 
if a community detection method has been able to discover clusters of vertices tightly interconnected among each other 
and loosely interconnected with those belonging to other clusters. We assume that, the higher the values of network 
modularity provided by an algorithm, the better the community structure of the network has been unveiled. Given the 



limitations of the network modularity - as discussed in Section 2.2 - this approach is only a best-approximation of 
any robust evaluation method. Indeed, the problem of evaluating the clustering quality of real-world networks lacking 
of a ground truth is an open and urgent problem in current literature. 
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5.1.1. Synthetic networks 



The first set of experiments has been carried out by using the LFR benchmark presented by Lancichinetti et al. Il29l . 
In the LFR benchmark a user is required to provided the following information to generate a graph G: ( i) Number 
of Vertices, denoted as A^, and Average Vertex Degree, denoted as (k). (ii) Power Law exponent in vertex degree 
distribution denoted as y. The vertex degree distribution in G is shaped as k^~^ . (Hi) Power Law exponent in community 
size distribution denoted as fi. The size of communities in G follows a power law distribution with exponent equal to 
/?. The sum of the sizes of all the communities is constrained to be equal to A^. (iv) Mixing parameter, denoted as ji. 
A user can specify a parameter ji e (0, 1) such that each vertex in G shares a fraction \ - ji of its edges with vertices 
outside its community and ji edges with vertices residing in its community. The value yU = 0.5 is the threshold beyond 
which clusters are no longer defined in the strong sense (that is that each vertex has more neighbors in its own cluster 
than in the others.) 

To generate our benchmark networks, we adopted the same configuration reported in ll29l : (i) N - 1000; (ii) four 
pairs of values for y and y6 defined as follow: (y,[i) - (2,1), (2,2), (3,1), (3,2); (Hi) for each pair of exponents y and y6, 
three values of average degree {k) - 15, 20, 25; (iv) for each of the previous combinations, we generated six networks 
by varying the mixing parameter yu =0.1,..., 0.6. 

To compute the quality of the results, we adopted the measure called normalized mutual information (NMI) Q- Such 
a measure assumes that, given a graph G, a ground truth is available to verify what are the clusters (said real clusters) 
in G and what are their features. Let us denote as A the true community structure of G and suppose that G consists of 
Ca clusters. Let us consider a clustering algorithm applied on G and assume that it identifies a community structure 
B consisting of cg clusters. We define a x matrix - called confusion matrix CM - such that each row of CM 
corresponds to a cluster in A whereas each column of CM is associated with a cluster in B. The generic element 
CMij is equal to the number of elements of the real /-th cluster which are also present in the y'-th cluster found by the 
algorithm. Starting by this definition, the normalized mutual information is defined as 



where A^,. (resp., A^.^) is the sum of the elements in the /-th row (resp., j-th column) of the confusion matrix. If the 
considered clustering algorithm would work perfectly, then for each discovered cluster j, it would exist a real cluster 
; exactly coinciding with / In such a case, it is possible to show that NMI(A, B) is exactly equal to 1 [9J. By contrast, 
if the clusters detected by the algorithm are totally independent of the real communities then it is possible to show 
that the NMI is equal to 0. The NMI, therefore, ranges from to 1 and the higher the value, the better the clustering 
algorithm performs with respect to the the ground truth. 

The performance provided by CONCLUDE, reported in Table [T[ shows excellent values of NMI considering the task 
of solving the LFR benchmark. For each configuration, the partition provided by our algorithm is compared against the 
ground truth built by the LFR benchmark, measuring the goodness of the partition according to the previously-defined 
normalized mutual information measure. 

CONCLUDE provides in general high values of NMI for the setting of p - 0.1, ... , 0.3, which lead to the presence 
of strongly defined clusters in the synthetic networks. Moreover, it is worth to note that the values of NMI are stable 
across different configurations. Fixed the mixing parameter p, let the parameters {k), y and p vary; their variation 
reflects on the feature of the networks generated by the benchmark. We observe, in this scenario, that our strategy 
works well and consistently according to different network structures and that the performance is independent of 
particular network features (such as the degree distribution or the size of clusters present in the network.) 

5.7.2. Real-world networks 

In this section we discuss the results obtained by analyzing different graphs describing real-world networks datasets. 
Summary statistics of these networks are reported in Table [2] 
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Table 1 : Values of normalized mutual information provided by CONCLUDE resolving the clusters in the networks with community structure 
artificially-generated according to the LFR benchmark [29J . 

In particular, Datasets 1-5 represent the undirected networks of coauthors of article appeared in Arxi\j^ as of April 
2003 131], in the field of, respectively. General Relativity and Quantum Cosmology - CA-GrQc, High Energy Physics 
(Theory) - CA-HepTh, High Energy Physics (Phenomenology) - CA-HepPh, Astro Physics - CA-AstroPh, and Con- 
densed Matter Physics - CA-CondMat. 

Dataset 6 describes a small sample of the Facebook network, representing its directed friendship graph ||45l . Finally, 
dataset 7 represent a large sample of the Facebook network collected by Gjoka et al. Il22l . 

This experiment has been designed to quantitatively evaluate the performance of our strategy in real-world applica- 
tions. To configure the ERW-Kpath, the values of p and fi have been tuned as previously suggested. In addition, the 
value of the maximum length of the self-avoiding random walks has been set equal to a- = 20. 

The results are measured by means of the value of network modularity (formally defined by Equation [TJ obtained 
by CONCLUDE, compared against those attained by using three diff'erent techniques: (i) the already presented Lou- 
vain method (LM), (ii) COPRA |25|, which is a fast clustering detection algorithm based on the principle of label 
propagation [42 J, and, finally (Hi) OSLOM OOl . a local optimization algorithm able to finding statistically significant 
clusters. 

Prior to presenting the results of our tests, we briefly describe the main features of COPRA and OSLOM. 

COPRA (Community Overlap PRopagation Algorithm) relies on a label propagation strategy proposed for the first 
time by Raghavan et al. in li42J . The algorithm works in three stages: (i) Initially, each vertex v is labeled with a 
set of pairs (c, b), being c a community identifier and b {belonging coefficient) a coefficient indicating the strength of 
the membership of v to the community c; belonging coefficients are normalized so that the sum of all the belonging 
coefficients associated with v is equal to 1 . Initially, the community associated with a vertex coincide with the vertex 
itself and the belonging coefficient is 1 . f ii) Then, repeatedly, v updates its label so that the set of community identifiers 
associated with v is put equal to the union of the community identifiers associated with the neighbors of v; after that. 



^Arxiv jhttp://arxiv.org/j) is an online archive for scientific preprints in the fields of Mathematics, Physics and Computer Science, amongst 
others. 
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the belonging coefficients are updated according to the following formula 

= — — 



being A^(v) the set of neighbors of v and bi(c, v) the belonging coefficient associated with v at the i-th iteration. At 
each iteration, all the pairs in the label of v having a belonging coefficient less than a threshold are filtered out; in such 
a case the membership of v to one of the deleted communities is considered not strong enough. It is possible that all 
the pairs in a vertex label have a belonging coefficient less than the threshold. In such a case, COPRA retains only 
the pair that has the greatest belonging coefficient and deletes all the others. Finally, if more than one pair has the 
same maximum belonging coefficient, below the threshold, COPRA selects at random one of them and this makes the 
algorithm non-deterministic. After deleting pairs from the vertex label, the belonging coefficients of each remaining 
pair are re-normalized so that they sum to 1. A stopping criterion ensures COPRA ends after a finite number of steps. 
In such a case, the set of community identifiers associated with v identify the communities to which v belongs to. 

OSLOM (Order Statistics Local Optimization Method) is a multi-purpose technique that aims at managing directed 
and undirected graphs. OSLOM is also able to detect overlapping communities and to build hierarchies of clusters. 

The strategy of the algorithm to discover clusters in a graph G is as follows: at the beginning a vertex / is selected at 
random and it forms the first cluster C - {/). After that, the q most statistically significant vertices in G are identified 
and added to C. Here qisa random number and the significance of a vertex v is a parameter indicating the likelihood 
that V can be inserted in C. To formally define the statistical significance, OSLOM considers a random null model, 
i.e., a class of networks without community structure. A network G in the random null model is generated by first 
copying all the vertices of G in G . After that, multiple pair of edges in G are selected at random and an edge is 
drawn between them. Due to this procedure, given a vertex v in G, there will exist a vertex v in G corresponding to 
w. Analogously, given a subgraph C in G, there will be a subgraph C in G corresponding to C such that each vertex 
in C corresponds to a vertex in C. The null model is expected not to have a community structure and, therefore, it can 
be used as a benchmark to understand if a subgraph C in G is a community and to define the statistical significance of 
a vertex v to C. In particular, we count the number li of vertices linking v with vertices in G; after that, we consider 
the vertex v corresponding to v in G and we count the number I2 of edges linking v with vertices residing in C . If 
li > I2 we guess that v is significant to C (and can be included in it.) 

A community C can be associated with a score representing its quality; the score of a cluster C indicates to what extent 
C contains vertices which have a high statistical significance with it. The main idea of OSLOM is to progressively 
add and remove vertices within C so that to improve its score; this procedure is called clean-up. 

The whole process introduced above is repeated several times starting from different nodes in order to explore different 
regions of G. This yields a final set of clusters that may overlap. 

The outcome of our experimental tests are reported in Table |2] From the analysis of results reported in Table |2] we 
can draw some consideration about the performance of the proposed clustering method. 

Considering these real-world scenarios, CONCLUDE outperforms all its competitors in terms of attained values of 
network modularity. In general, results provided by CONCLUDE are better than those provided by the Louvain 
method in average of 5%-15%, pushing the improvement up to 25% in the case of COPRA and OSLOM. This 
advantage can be explained considering two different motivations: (i) our strategy aims at the maximization of the 
network modularity of a weighted network, producing weights according to an intrinsic rationale, driven by the ERW- 
Kpath algorithm, while neither the Louvain method nor COPRA or OSLOM (and, in general, most of the state-of-the- 
art clustering algorithms IfTTl ) are able to produce a weighting for any unweighted network. 

In addition and equally important, our strategy relies both on local and global information, an aspect which makes 
CONCLUDE what in recent literature is called a glocal optimization algorithm. In fact, the first step of our strategy 
exploits information on a long-range, carrying out a random walker that visits, starting multiple times from each 
node, not only the neighborhood, but also those regions of the graph far from the origin of the walk. This global 
information is exploited within the second step, the computation of the distance among all pairs of vertices. Finally, 
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local information is exploited by the modularity optimization strategy inspired by the Louvain method itself. The 
final result is a general improvement of the performance of the clustering procedure of a non-negligible factor, which 
comes at almost no cost (in fact, the quality of the partition is proven very good by considering the values of NMI 
provided in the previous experiment.) 



N. 


Network 


No. vertices 


No. edges 


CONCLUDE 


LM 


COPRA 


OSLOM 


1 


CA-GrQc 


5,242 


28,980 


0.883 


0.860 


0.761 


0.696 


2 


CA-HepTh 


9,877 


51,971 


0.806 


0.772 


0.768 


0.653 


3 


CA-HepPh 


12,008 


237,010 


0.760 


0.656 


0.754 


0.675 


4 


CA-AstroPh 


18,772 


396,160 


0.663 


0.627 


0.577 


0.596 


5 


CA-CondMat 


23,133 


186,932 


0.768 


0.731 


0.616 


0.692 


6 


Facebook-links 


63,731 


1,545,684 


0.664 


0.626 


0.726 


0.391 


7 


SocialGraph 


613,497 


2,045,030 


0.912 


0.891 


0.197 


0.456 



Table 2: Values of network modularity provided by CONCLUDE in the context of cluster detection from different real-world network datasets. 
Our algorithm outperforms three of the state-of-the-art solutions, e.g., Louvain method, COPRA and OSLOM. 

5.2. ERW-Kpath and K-path edge centmlity 

In addition to discuss the performance of the clustering method, we here describe some empirical evidence of the 
A--path edge centrality measure as approximated by means of the ERW-Kpath algorithm, which is adopted by CON- 
CLUDE to rank edges. This is instrumental to understanding the functioning of the glocal optimization. 

In particular, in the following we report an experiment aimed at discovering how different values of k impact on the 
final edge centrality. To this purpose, we produce a probability distribution of /c-path values obtained varying the 
setting for k to understand its general behavior. 

In detail, we consider the datasets presented in Table |2] separately and apply the ERW-Kpath algorithm varying the 
/f-path length as a- - 5, 10,20. After that, for a fixed value of A:-path edge centrality L, we compute the probability 
Pr(L) of finding an edge with such a centrality value. The corresponding results are plotted in Figure 1 for the top four 
largest datasets (namely, CA-HepPh, CA-AstroPh, CA-CondMat and Facebook-links.) To show the scaling behavior 
of the distributions, for each plot we adopt a logarithmic scale. 

The analysis of these plots highlights some relevant facts. First of all, a heavy-tailed distribution in the edge centrality 
values emerge^in presence of all different values of k. In other words, if we use different values of k, the centrality 
indexes may change; however, as emerges from the plots, the curves representing /c-path centrality values resemble 
parallel lines. This implies that, for a fixed value of k, say a- = 5, an edge e will have a particular centrality score. But, 
if K is increased from 5 to 10 and, then, from 10 to 20, the centrality of e will be increased always by a constant factor. 

This aspect reflects the ability of the ERW-Kpath algorithm to identify those edges which are in fact central in the 
structure of the network, and rewarding them with high weights. Intuitively, also those edges which are less relevant 
will be still awarded, even if for a smaller number of times, which will lead to lower values of centrality (and, in 
the end, to the heavy-tailed distribution.) These weights, which are computed according to a global rule (that is, 
discovering by means of a random walker those edges which are more likely to be traversed during a process of 
spreading information on the given network) are subsequently exploited to compute overall distances among pairs 
of vertices. Eventually, the distances computed on the base of global information, will be exploited to identify the 
network clusters according to local optima achieved by the modularity optimization greedy strategy. This summarizes 
the glocal optimization nature of CONCLUDE. 



In a log-log scale, a distribution that resembles a straight line depicts a scale-free behavior. 
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Figure 1: The probability distribution of /c-path edge centrality, computed according to different configurations of *• = 5, 10, 20 for the five largest 
networks considered in this article: CA-HepPh, CA-AstroPh, CA-CondMat, Facebook-links and SocialGraph. 
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6. Conclusions 



This article presents CONCLUDE, an efficient method for detecting clusters in complex networks which is proven to 
work well in different domains. An early implementation of this algorithm has been already released and its strengths 
and performance might be assessed independently by other authors. 

Our ongoing research efforts focus on adopting CONCLUDE in several contexts. We are investigating: (i) the emer- 
gence of a community structure in large online social networks such as Facebook lfT5l[T6l : 

(i) the assessment of sociological conjectures that involve finding clusters according to importance of edges, for 
example the strength of the weak ties theory [24], and (Hi) enhancing the performance of different state-of-the-art 
clustering algorithms (such as COPRA ||25l or OSLOM |30|) by pre-processing networks by means of the random 
walk based measure of centrality like the /c-path edge centrality li 11 i . 

Finally, a long-term research evaluation of our method is planned, in order to cover different domains of application: 
for example, the application of CONCLUDE could be promising in the context of Neuroinformatics, applied to the 
connectome (i.e., the human brain functional network) ll33l or Bioinformatics, to detect protein complexes in protein- 
interaction networks 1361 . 

Further extensions of CONCLUDE will be designed to face additional scientific challenges, such as the possibility of 
discovering overlapping clusters. So far, our algorithm is able to produce a strong partition of the network, but does 
not allow partitions to overlap each other. This feature will be instrumental in some contexts in which is meaningful 
that nodes contemporary belong to different clusters (for example, in the case of social networks, users might be part 
of different communities at the same time.) 
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